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Abstract 

Languages across the world exhibit Zipf’s law of abbreviation, namely 
more frequent words tend to be shorter. The generalised version of the 
law - an inverse relationship between the frequency of a unit and its 
magnitude - holds also for the behaviours of other species and the 
genetic code. The apparent universality of this pattern in human lan¬ 
guage and its ubiquity in other domains calls for a theoretical under¬ 
standing of its origins. To this end, we generalise the information the¬ 
oretic concept of mean code length as a mean energetic cost function 
over the probability and the magnitude of the types of the repertoire. 

We show that the minimisation of that cost function and a negative 
correlation between probability and the magnitude of types are inti¬ 
mately related. 

Keywords:: Zipf’s law of abbreviation, compression, information theory, 
language, animal behaviour. 
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1 Introduction 


Zipf’s law of abbreviation, the tendency of more frequent words to be shorter 
[T] , holds in every language for which it was tested [Tl[2l[3llll[5l[6l[71|8l[9] 
(Fig. g (a)), suggesting that language universals are not necessarily a myth 
m- A generalised version of the law, i.e. a negative correlation between 
the frequency of a type from a repertoire and its magnitude (e.g., its length, 
size or duration), has been found in the behaviour of other species |111 IT^ 
[T3l O [13] (Fig. (b)) and in the genetic code [15]. This is strong evidence 
for a general tendency of more frequent units to be smaller, i.e. less cost¬ 
intensive. The robustness and recurrence of this pattern calls for a theoretical 
understanding of the mechanisms that give rise to it. 

The common interpretation of the law as an indication of the efficiency 
of language and animal behaviour mmM suffers from Kirby’s problem 
of linkage, i.e. the lack of a strong connection between potential processing 
constraints and the proposed universal m Here we address the problem of 
linkage for the law of abbreviation with the help of information theory. 

Information theory sheds light on the origins of many regularities of nat¬ 
ural language, e.g., duality of patterning |18| . Zipf’s law for word frequencies 
|19[ 120] ■ Clark’s principle of contrast [21], a vocabulary learning bias in chil¬ 
dren |21j and the exponential decay of the distribution of dependency lengths 
|22[ I23j . Those examples suggest that the solutions of information theory 
to communication problems can be informative for natural languages too, 
though efforts in information theory research have been directed towards 
solving engineering problems, not linguistic problems |24| . 

Here we investigate the law of abbreviation in the light of the problem of 
compression from standard information theory |25[ 126] . In this context, the 
mean code length is defined as 


L = '^Pili, ( 1 ) 

i=l 

where pi and li are, respectively, the probability and the length in symbols 
of the Tth type of a repertoire of size V. In the case of human language, 
the types could be words, the symbols could be letters and the repertoire 
would be a vocabulary. Solving the problem of compression provides word 
lengths that minimise L when the pi’s are given. An optimal coding of types 
by using strings of symbols (under the wide scheme of uniquely decipherable 
codes) satisfies |25| 

li=\-logNPi], ( 2 ) 


2 


where N is the size of the alphabet used to code the types. Eq. [^is indeed 
a particular case of Zipf’s law of abbreviation. 

Eq. can also be interpreted as an energetic cost function where the 
cost of every unit is exactly its length. Based on this assumption, we will 
generalise the problem of compression in two ways. First, we put forward a 
cost function A, i.e. 

y 

^ = (3) 

i=l 

where Aj is the energetic cost of the Ath type. In his pioneering research, G. 
K. Zipf already proposed a particular version of Eq. |^to explain the origins 
of the law of abbreviation using qualitative arguments [U p. 59]. Following 
this line of argument, we address Kirby’s problem of linkage m showing 
how the minimisation of A can produce the law of abbreviation. 

We assume that the energetic cost of a unit is a monotonically increasing 
function of its length A* = g{li). For instance, the energy that is needed 
to articulate the sounds of a string of length li is assumed to increase as U 
increases. Second, we generalise li as a magnitude (a positive real number). 
This way, li can indicate not only the length in syllables of a word [T] or the 
number of strokes of a Japanese kanji [6] but also the duration in time of 
a vocalisation |26) or the amount of information of a codon that is actually 
relevant for coding an amino acid |15| . Durations are important in the case 
of human language because words that have the same length in letters, and 
even the same number of phonemes, can still have different durations |27). A 
review of costs associated with the length or duration of a unit in human 
language and animal behaviour can be found in |26) . 

Under these two assumptions A becomes 

V 

^ = '^Pi9{k)- (4) 

i=l 

A is equivalent to L when g is the identity function. We assume that g{l) is 
a strictly monotonically increasing function of 1. The same assumption has 
been made for the cost of a syntactic dependency as a function of its length 
in word order models |28) . 

Assuming that g is the identity function, an equivalence between the 
minimisation of A and Zipf’s law of abbreviation is suggested by statistical 
analyses showing that any time that A is significantly small, the correla¬ 
tion between frequency and magnitude is significant (and negative) and vice 
versa |26| . Furthermore, theoretical arguments indicate that the law of ab¬ 
breviation follows from minimum A when the empirical distribution of type 
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lengths and type frequencies is constant and g is the identity function |26) . 
This is indirect evidence of a connection between compression and the law 
of abbreviation. 

Here we present direct connections between the minimisation of A and 
a generalised law of abbreviation: (1) a link between the minimisation of A 
and the maximisation of the concordance with the law of abbreviation using 
Kendal and Pearson correlation, (2) the relationship between optimal coding 
and the law of abbreviation through Kendal correlation, and (3) the fact that 
A is inherent to the Pearson correlation. In consequence, our research is in 
the spirit of recent studies on the origins of Zipf’s law for word frequencies 
through optimisation principles [13120 Ei- 

2 Predicting the law of abbreviation 

A generalised law of abbreviation is dehned often simply as a negative cor¬ 
relation between the frequency of a unit and its magnitude [I2lll3l|9]. We 
consider three measures of correlation: Pearson correlation (r), Spearman 
rank correlation (p) and Kendall rank correlation (r) |30j . While Pearson 
is a measure of linear association, p and r are measures of both linear and 
non-linear association |31l I32j . Alternatives to r are necessary because the 
functional dependency between frequency and length is modelled by means 
of non-linear functions [3] and the actual g may not be linear. 

r and p have been used in previous research on the generalised law of 
abbreviation m ii3]. Here we additionally introduce r |30j . The non- 
parametric approach offered by p and r allows one to remain agnostic about 
the actual functional dependency between frequency and magnitude [^ and 
avoids common problems of assuming concrete functions for linguistic laws 

jSlEl]. 

2.1 The relationship between A and r 

Here we will unravel a strong dependency between the minimisation of A 
and the minimisation of Kendall’s r, a rank measure of correlation between 
Pi and li, under two different conditions: multiset constancy and minimum 
A. We have multiset constancy when the multiset of pj’s and the multiset of 
Aj’s are constant. This condition has been used to test the signihcance of A 
|26j . and is the central assumption of correlation tests such as the ones used 
to test for the law of abbreviation |34] . 

r is based on the concept of concordant and discordant pairs. In our 
case, {pi,li) and {pj,lj) are 
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• Concordant if either pi > pj and li > Ij or pi < pj and /j < Ij 

• Discordant if either pi > pj and li < Ij or pi < pj and li > Ij 

• Non concordant if pi = pj or li = Ij. 

Uc and Ud are defined, respectively, as the number of concordant and discor¬ 
dant pairs, no is the total number of pairs, i.e. 


no = 


D(C-l) 


(5) 


Then Kendall’s r is defined as a normalised difference between ric and n^. 


i.e. 


r = 


ri c - rid 
no 


( 6 ) 


If the agreement with Zipf’s law of abbreviation was perfect, i.e. if we had 
only discordant pairs, then we would have r = — 1. 

We will investigate the consequences of choosing a pair of indices at 
random, i and j {i ^ j) and swapping either pi and pj or li and Ij. A 
prime will be used to indicate the value of a quantity or measure after the 
swap. For instance, r' and Z- indicate, respectively, the value of r and that 
of li after the swap. A,- = t' — t and Aa = A' — A indicate the discrete 
derivative of r and A, respectively. For instance. A,- < 0 indicates that the 
concordance with the law of abbreviation increases after one swap. If pi’s are 
real probabilities (not frequencies from a sample) and the /j’s are durations 
|271 fT3] (not discrete lengths), ties are unlikely. For this reason we assume 
that there are no ties (and therefore non concordant pairs are missing) to 
investigate the relationship between A and r. This has the further advantage 
of simplifying the mathematical arguments. A careful analysis shows that 
(see supplementary online information for further details) 

• if the pair {pi,li) and {pj,lj) is concordant, then A a, A,- < 0. 

• if the pair (pi, li) and {pj, Ij) is discordant, then Aa, At > 0. 

The results above can be summarised as 


AaAt > 0. 


(7) 


This result means that if one of the two changes (e.g.. A), the other (e.g., r) 
also changes in the same direction (and vice versa). 

It is easy to show that minimum A implies r < 0 (even if there are ties 
of probability or magnitude). The fact that removing a concordant pair (by 
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swapping) always decreases A allows us to conclude that Uc = 0 is a necessary 
condition for minimum A. Applying ric = 0 to Eq. and knowing that ric 
and no are positive, one concludes that r < 0 with equality if and only if 
rid = 0. The latter condition can be refined taking into count the optimal 
coding with non-singular codes with an alphabet of size N (supplementary 
online information) and assuming that there are no probability ties. In that 
case, = 0 is equivalent to V < N. These arguments strengthen and 
generalise previous results on the need of the law of abbreviation in case 
of optimal coding assuming that (1) g is the identity function and that (2) 
swaps are applied to types that are consecutive in an ordering of types by 
decreasing probability |26) . 

Let US come back to the general framework of standard information the¬ 
ory under the scheme of non-singular codes, namely two different types can¬ 
not be assigned the same string. In that case, optimal coding (minimum A 
when the pi’s are given) can be achieved with the following procedure: 

1. Generate the sequence of the V shortest strings that can be produced 

with an alphabet of size N. This gives a sequence of strings from 
length Z = 1 to length Imax (for length I = Imax it might be necessary 
to choose some of the strings of length Imax arbitrarily). 

2. Sort the sequence by increasing length (and within each length follow¬ 
ing lexicographic order). In case of a binary alphabet {N = 2), and 
V = 10, this yields the sequence 0, 1, 00, 01, 10, 11, 000, 001, 010, 011. 

3. Assign the i-th string of the sequence above to the i-th most probable 
type (in case of a tie in pi’s, sort the types involved arbitrarily). 

It is easy to see that the coding procedure minimises A. First, notice that 
it satisfies the requirement that ric = 0. Second, notice that it involves the 
shortest strings possible and is therefore optimal (a detailed proof is available 
in the supplementary online information). 

If Pi is the probability of the Tth most probable type, the optimal pro¬ 
cedure above yields (see supplementary online information) 


li = 



for N > 1 
for N = 1. 


( 8 ) 


When N and i are sufficiently large. 


k ^ logjsii- 
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(9) 





Notice that Eq. relates the rank of a type (according to its probability) 
with its length for optimal non-singular codes. Interestingly, that equation 
can be regarded as an analogue of Eq. which relates length and proba¬ 
bility for optimal uniquely decipherable codes. Both Eq. |^and Eq. [^mean 
that length tends to grow as probability rank increases and thus lead to a 
negative correlation between probability and magnitude. Therefore, a nega¬ 
tive correlation between probability and magnitude is expected both under 
multiset constancy and also when there is freedom to assign any magnitude. 

Bear in mind that removing all concordant pairs can lead to a drastic 
reduction of A but does not warrant that the coding is optimal in an in¬ 
formation theoretic sense. Suppose that magnitudes are string lengths (in 
symbols), and that there are neither probability ties nor length ties, i.e. 
nc — n-d = riQ. After removing all concordant pairs by swapping we get 
He = 0, and thus Eq. gives r = —1, the strongest negative correlation 
possible. However, optimal coding with discrete units implies length ties for 
V > 2 (supplementary online information). 


2.2 The relationship between A and p 


Indirect relationships between p and A follow from those between p and r. 
On the one hand, p satisfies 


( 10 ) 


^(3r-l)<p<^(l + 3r). 


On the other hand, p satisfies 

(l + r)2 


- 1 < p < 1 - 


(1 -r) 


( 11 ) 


2 - r - 2 ’ 

as illustrated in Fig. See m for further relationships between p and r. 


2.3 The relationship between A and Pearson’s r 


The Pearson correlation between the probability of a unit (p) and its ener¬ 
getic cost (A) is 


E\p\] - E\p\E[\] 
a\p]a[X\ 


( 12 ) 


where E[x] and a\x\ are, respectively, the expectation and the standard 
deviation of a random variable x. E[x\ can be regarded as the average value 
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of X obtained when drawing types uniformly at random from the repertoire. 
Knowing 


2 = 1 



Eq. 12 can be expressed as 


A - g[A] 

Va\p\a[Xy 


(13) 

(14) 


(15) 


Therefore, r is a function of A. 

Note that in quantitative linguistics, the term type is used to refer to a 
string of symbols and the term token is used to refer to an occurrence of 
a type |38]. The term type is used with the same meaning in quantitative 
studies of animal behaviour |39] or in Mandelbrot’s pioneering work m- 
i?[A] and (t[A] are, respectively, the mean cost and the standard deviation of 
the cost of the types. 

Now, let us assume a constancy condition: V, £1[A], a\p] and cj[A] are 
constant (see the supplementary online information for a justification). If 
that simplifying condition holds, Eq. 15 indicates that the minimisation 


of A is equivalent to the minimisation of r, which in turn maximises the 
concordance with the law of abbreviation because g{l) is a monotonically 
increasing function of 1. The same conclusion can be reached with a simpler 
but less general constancy condition, namely, multiset constancy. 

r < 0 can be regarded as concordance with the law of abbreviation. To 
know if r < 0, it is not necessary to actually calculate r with Eq. [T^ On the 
right hand side of this equation, the denominator is positive (since V and 
the standard deviations are positive). Hence, the sign of r is determined by 
the sign of the numerator. Therefore, r < 0 if and only if 


V . V 

A = ^PiAj < E;[A] = — ^ Ai, (16) 

i=l ^ i=l 


i.e. the expected energetic cost of types when selecting them according to 
their probability (p) is smaller than the expected energetic cost of types 
picking them uniformly at random from the repertoire. Eq. [^tell us that a 
negative sign of r is equivalent to a mean energetic cost of tokens that does 
not exceed the mean energetic cost of types. 
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A limitation of the connection between r and A above is not only the 
validity of the constancy condition but also that r is a measure of linear 
association. A priori, we do not know if the relationship between p and A is 
linear. For this reason it is vital to explore a connection between measures 
of correlation that can capture non-linear dependencies. 


3 Discussion 


Assuming some constancy conditions on probabilities and magnitudes, we 
have shown an intimate relationship between the minimisation of A and the 
minimisation of various measures of correlation between the probability of 
a type and its magnitude. Notice that minimisation of the correlation is 
equivalent to the maximisation of the concordance with the law of abbrevia¬ 
tion. This potentially explains the ubiquity of a generalised version of Zipf’s 
law of abbreviation in human language and also in the behaviour of other 
species. 

More specifically, we have shown that Pearson’s r contains A in its defini¬ 


tion (Eq. 12). A straightforward relationship between a function to optimise 
and correlation is also found in methods for community detection in networks 
(see supplementary online information for further details). Our mathemati¬ 
cal results further shed light on previous results on the law of abbreviation 
involving r. 

First, r has been used to investigate a generalised version of Zipf’s law 
of abbreviation in dolphins surface behavioural patterns [12] and the vo¬ 
calisations of Formosan macaques m- The magnitude of dolphin surface 
behavioural patterns was measured in elementary behavioural units while 
the magnitude of Formosan macaque vocalisations was measured by their 


duration. The mathematical connections presented in Section 2.3 predict a 
significantly low mean cost A for Formosan macaques m and dolphins via 
the significant negative r found in both species |13l [T2| . Only two assump¬ 
tions are required: multiset constancy for both the correlation test and the 
test of significance of A, and that the energetic cost of a signal is propor¬ 
tional to its magnitude. This prediction is confirmed by the analysis of the 
significance of A in dolphin surface behavioural patterns and the vocalisation 
of Formosan macaques |26| . 

Second, the same arguments predict a significant negative Pearson cor¬ 
relation between frequency and magnitude from the significantly low A that 
has been found in various languages [26], although that correlation was not 
investigated for them. 
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In spite of the predictive power of the minimisation of A, we do not mean 
that the law of abbreviation is inevitable. Exceptions are known in other 
species 13 Ej. This is not surprising from the perspective of information 
theory. Solving the problem of compression is in conflict with the problem of 
transmission: redundancy must be added in a controlled fashion to combat 
noise in the channel |25l p. 184]. Consistently, the law of abbreviation 
prevails in short range communication UM- 

3.1 Compression versus random typing 

Simple mechanisms such as random typing HU US sa HU, can reproduce 
the law of abbreviation |45) . Random typing models produce "words" by 
concatenating units, one of them behaving as "word" delimiter. Some re¬ 
searchers regard random typing as not involving any optimisation at all 
Haim. However, its conceivable that the dynamical rules of random typing 
arise or are reinforced or stabilised by compression, given 

• The equivalence between the law of abbreviation and compression out¬ 
lined above. 

• The optimality of the nonsingular coding scheme in the definition of 
random typing models (see supplementary online information for fur¬ 
ther details). 

In that case, random typing could be seen as a special manifestation of 
compression. Another connection between optimisation and random typing 
is that stringing subunits to form "words" as in random typing is a convenient 
strategy to combat noise in communication |18) . 

Having said this, it is unlikely that random typing is the mediator be¬ 
tween compression and the law of abbreviation in human languages. A se¬ 
rious limitation of random typing models is that the probability of a word 
is totally determined by its composition. In simple versions of the model 
HSiaiii], the probability of a word is determined by its length (the char¬ 
acters constituting the words are irrelevant), i.e. the length of a word (/) is 
a decreasing linear function of its probability (p) 


(17) 


I = a logp -|- b, 


where a and b are constants (a < 0). To see it, notice that the probability 
of a "word" w is HS p. 838] 



( 18 ) 


10 




where I is the length of w, Ps is the probability of producing the word de¬ 
limiter, N is the size of the alphabet that the words consist of {N > 1) and 


If) is the minimum word length {Iq > 0). Eq. 18 allows one to express I as a 
function of p{w). Rearranging the terms of Eq. 18 taking logarithms, and 


replacing p{w) by p, one recovers Eq. 17 with 


a = log 


I-Ps 

N 


-1 


(19) 


and 


b = a log 




Ps 


( 20 ) 


Another limitation of random typing is that it is not a plausible model for 
human language from a psychological perspective |16) and also from a social 
perspective: the "words" produced by random typing are not constrained 
by a predetermined vocabulary of words whose meanings have been agreed 
upon by social interaction among individuals as in human language m 
Also, comparing the statistical properties of random typing against real 
languages reveals striking differences: 


• The distribution of "word" frequencies deviates from that of actual 
word frequencies significantly |16) . 

• While random typing yields a geometric distribution of "word" lengths, 
the actual distribution of word lengths is not a monotonically decreas¬ 
ing function of I [Ml . 

• In the simple versions of the model, words of the same length are 
equally likely, a property that real languages do not satisfy |5ni l5T] . 

Another challenge for random typing are homophones: words that have the 
same composition (i.e. the same sequence of phonemes, and thus, the same 
length in phonemes) but can have different frequencies. Interestingly, given 
a pair of homophones the more frequent one tends to have a shorter dura¬ 
tion in time m- This is impossible in random typing but explainable by 
the minimisation of A. However, random typing might be relevant for the 
finding of the law of abbreviation in bird song jlS] , where the need for social 
consensus about the meaning of a song is missing or secondary while pressure 
for song diversity is crucial to maximise the chances of mating |52| . 
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3.2 From Zipf to standard information theory 

At a more general level, our contributions can be interpreted in two direc¬ 
tions. First, we have made a step forward in formalising Zipf’s view (e.g., 
Zipf’s minimum equation) jl] with information-theoretic rigour. Second, we 
have contributed to expand standard information theory beyond uniquely 
decipherable codes. We have presented an optimal coding procedure under 
the broader class of non-singular codes, and shown that the minimisation of 
cost leads to a negative correlation between the probability of a type and 
its magnitude (a generalised law of abbreviation) under wide conditions. 
These results are crucial for research into natural communication systems. 
While standard information theory is focused on uniquely decipherable codes 
1251 ESI El ES], real languages do not fit that scheme: given a string of let¬ 
ters, there might be more than one way of breaking it into words. We hope 
that our work on optimal coding helps to change the view that information 
theory does not contribute to understanding natural language problems j24] . 
Further linguistic applications of our theoretical framework beyond the law 
of abbreviation are presented in the supplementary online information. 

3.3 Causality 

A challenge for our theoretical arguments is the extent to which the minimi¬ 
sation of A is a causal force for the emergence of Zipf’s law of abbreviation. 
The apparent universality of the law of abbreviation in languages and the 
multiple theoretical connections between compression and the law suggest 
that the minimisation of A is indeed a causal force. Additional support for 
compression as cause may come from the investigation of its predictions in 
grammaticalisation, a process of language change by which words become 
progressively more frequent and shorter |56[ EZ]. In the worst case (i.e. 
compression is not the driving force), compression would still illuminate the 
optimality of the law of abbreviation and its stability once it was reached 
through a mechanism unrelated to compression. 
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Figure 1: The law of abbreviation in human languages and Formosan 

macaques, (a) The relationship between word frequency and word length 
(in characters) for the English and Estonian translations of Universal Decla¬ 
ration of Human Rights (http://www.unicode.org/udhr/). The Estonian 
words chosen are ja (and), on (the), oigus (right), artikkel (article) and krim- 
inaalsiiudistuste (criminal prosecutions), (b) The relationship between call 
type frequency call type mean duration (in ms) in Formosan macaques (data 
borrowed from m)- 
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Figure 2: The gray region covers the values of Spearman p satisfying Eq. 
El as a function of Kendall r. 
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SUPPLEMENTARY ONLINE INFORMATION 


A Optimal coding 


We have put forward a cost function A, i.e. 

V 


A = 


( 21 ) 


i=l 


where pi and Aj are, respectively, the probability and the energetic cost 
of the i-th type. We have assumed that the energetic cost of a type is a 
monotonically increasing function of its magnitude li, i.e. A* = g{li). When 
dih) = h and li is the length in symbols of the alphabet, A becomes L, the 
mean code length of standard information theory |25) . 

Here investigate the minimisation of A when the pi's are given. 

A.l Optimal coding without any constraint 

The solution to the minimisation of A when no further constraint is imposed 
is that all types have minimum magnitude, i.e. 


( 22 ) 


— ^min for i — 1, 2, ..., lA. 


If li is the length of the i-th symbol with li > 1, then Imin = 1. If confusion 
between types has to be avoided, the unconstrained minimisation of A implies 
that N, the size of the alphabet used to build strings of symbols, cannot be 
smaller than V. 

A.2 Optimal coding with nonsingular codes 

Standard information theory bears on the elementary assumption that dif¬ 
ferent types cannot be represented by the same string of symbols |25) . Under 
the wide scheme of uniquely decipherable codes, standard information tell 
us that the minimisation of L leads to |25| 


k oc [-logPi], 


(23) 


which is indeed a particular case of Zipf’s law of abbreviation. 

Here we investigate the optimal coding using nonsingular codes, a super¬ 
set of uniquely decipherable codes |25l p. 106]. The function to minimise 
is A, a generalisation of L the mean code length of standard coding theory 


m 
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We consider the set of all the strings of symbols that can be built with 
an alphabet of size N. Suppose that we sort the strings by increasing length 
(the relative ordering of strings of the same length is arbitrary), thus the 
strings in positions 1 to have length 1, the strings in positions iV + 1 to 
A^ +have length 2, and so on. Suppose that types to be coded are sorted 
by decreasing probability, i.e. 


Pi>P2> > Pv, 


(24) 


being pi the probability of the r-th type. Suppose that we assign the i- 
th string to the i-th. type for i = 1,2,...V . An example of this coding 
are random typing models |431 HQI HU E]. We will show that this coding 
method, which we refer to as method A is optimal. We will proceed in two 
steps. First, recall that minimum A requires ric = 0, where Uc is the number 
of concordant pairs (as explained in the main article). Recall also that the 
pair {pi,li) and {pj,lj) is concordant if and only if pi < pj and Zj < Ij or 
Pi > Pj and Zj > Ij. Notice that, by definition, method A produces no 
concordant pairs. Second, suppose that 

A" = EpiAf, (25) 

i=l 

where Af is the cost of the Z-th type according to some coding method x. 
We will show by induction on V that for any alternative method B that is 
based on nonsingular codes with Uc = 0, A^ > A"^. 


Setup. Since method A only produces pairs that are either concordant 


or discordant and probabilities obey Eq. 24, we have 


Af < 


< A( 


< Af 


Suppose that there are no probability ties. Then 

Af <... <Af...<A^ 


(26) 


(27) 


follows immediately. In case of probability ties, Eq. |^may not hold. 
Suppose V = 3, Pi = p 2 = P 3 and Zf = 2, Zf = Zf = 1. This coding 
lacks concordant pairs but does not satisfy Eq. 27 However, any 


coding produced by method B can be converted into one that satisfies 
Eq. [^with the same A by sorting all magnitudes increasingly in every 
probability tie. We assume that the codings produced by method B 
have been rearranged in this fashion. This is crucial for the inductive 
step. 
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• Basis: V = 1. Then 


= PiXf = Pig{l) (28) 

= piXf = pig{li). (29) 


Recalling that o' is a strictly monotonically increasing function, Eqs. 


which is trivially true by the definition of li. 


28 


and 29 indicate that the condition is equivalent to > 1, 


Inductive hypothesis. If Eq. 27 holds then 


^PiAf > '^PiXt 


i=l 2=1 


(30) 


• Inductive step We want to show that 



V+l v+l 

2=1 2=1 

(31) 

when 

Af < ... < Af... < Af+i. 

(32) 

Eq. 31 

is equivalent to 



S + 7'V'+i(Af+i — Ay_,_]^) > 0, 

(33) 

with 

^ = YPi(^^ ~ 

(34) 


2=1 


Let us define /f as the magnitude of the Ath type according to some 


coding method x. To show that Eq. 33 holds, it suffices to show that 

^v+i — ( 35 ) 


because S > 0 by the induction hypothesis (notice that if Eq. 32 holds 


then Eq. 27 also holds), pv+i is positive by definition and Af = g{lf), 
where s' is a strictly monotonically increasing function. Notice that if 


B 

V+l 


< I 


V+1 


(36) 


then method B would not employ nonsingular codes. To see it, notice 
that is the smallest integer such that 


-v+l 


v<J2n^, 


1=1 


(37) 
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where are all the strings of length I that can be produced. Thus, 


method B must assign the same string to different types when Eq. 36 
holds. 


We aim to derive the relationship between the rank of a type (defined 
according to is probability) and its length in case of optimal non-singular 
codes for > 1. Suppose that pi is the probability of the i-th most probable 
type and that /j is its length. The largest rank of types of length I is 


k=l 


When > 1, we get 


and equivalently 


^ = 


N{N^ - 1 ) 


A^- 1 


= —— ^i + 1. 
N 


Taking logs on both sides of the equality, one obtains 


I = - 


log + l) 


log(A^) 

The result can be generalised to any rank of types of length I as 

log + l) 


I = 


log{N) 


(38) 

(39) 

(40) 

(41) 

(42) 


Changing the base of the logarithm to N, one obtains 


I 


logiv 


N - 1 
N 



(43) 


The same conclusion has been reached |58) but lacking a detailed explanation 
like ours. The case A^ = 1 is trivial, one has I = i. Therefore, we conclude 
that the optimal coding with non-singular codes yields that the length of the 
Tth most probable type is 


li = 



for N > 1 
for A^ = 1. 


(44) 
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B The relationship between A and r 


Here we investigate the consequences of swapping pi and pj or li and Ij on 
Aa and At, the discrete derivative of A and r, respectively. For mathe¬ 
matical simplicity, we assume that there are no ties. Then i and j dehne a 
concordant pair or a discordant pair. Section B.l presents a result on the 
discrete derivative of ric which is crucial to conclude in Section B.2.2 that 

• If the pair {pi,li) and {pj,lj) is concordant. A,- < 0. 


• If the pair {pi,li) and {pj,lj) is discordant. A,- > 0. 
Moreover, Section [B .2.2 1 also shows that 

• If the pair {pi,li) and {pj,lj) is concordant, Aa < 0. 

• If the pair {pi,li) and {pj,lj) is discordant, Aa > 0. 


B.l The discrete derivative of the number of concordant 
pairs 

Hereafter we keep i and j for the subindices of the pairs being swapped 
and use x and y for the subindices of pairs in general, ric, the number of 
concordant pairs, can be dehned as a summation over all pairs, i.e. 

V V 

nc = (4^) 

x=l y=l 

where Cxy is an indicator variable. Cxy = 1 if Px < Py and lx < ly for the x-th 
and the y-th type; Cxy = 0 otherwise. Thus, the pair {px,lx) and {py,ly) is 
concordant if and only if Cxy or Cyx- Cxy can be expressed as a product of 
indicator variables, i.e. 

('xy — O'xybxyi (^ 6 ) 

where axy indicates if px < Py and bxy indicates if ^ ly‘ 

The assumption that there are no ties gives a couple of valuable proper¬ 
ties: 

• If X = y, axy = bxy = 0. 

• If X 7^: y 

^xy — 1 ^yx 
f'xy — 1 byx 


(47) 

(48) 


by symmetry. 
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We define as the number of concordant pairs after the swap. A' and 
B' are the state of matrices A = {axy} and B = {bxy} after the swap. 


B.2 li and Ij are swapped 

If only li and Ij are swapped, then A' = A^ but B' = B is not warranted. 
Interestingly, the changes in B concern only the i-th and the j-th row and 
the i-th and the j-th column, i.e. 


bji 

if X 

/ y 

and 

X 

= 

i 

and 

y 

= 

j 

bij 

if X 

+ y 

and 

X 

= 


and 

y 

= 

i 

^jy 

if X 

+ y 

and 

X 

= 

i 

and 

y 


j 

biy 

if X 

+ y 

and 

X 

= 


and 

y 


i 

bxj 

if X 

+ y 

and 

X 



and 

y 

= 

i 

bxi 

if X 

+ y 

and 

X 


i 

and 

y 

= 

j 

bxy 

otherwise 









(49) 


Then, the 1st derivative of Uc as a function of the number of swaps performed 
is 

= n'c - ric- (50) 

Suppose that 7 is sum of the values in the i-th and j-th row as well as in the 
i-th and j-th column of C = {cij} (if a value is found in both a column and 
a row it will be summed only once) and 7 ' as the value of 7 after the swap. 
Then the 1st derivative can also be defined as 


By definition. 


with 


An, =7-7- 

(51) 

7 = 5*1 -|- 5*2 -|- iSs -|- S'4 — T 

(52) 

V y 

Si = ^ Ciy = ^ aiybiy 

(53) 

y=l y=l 

V V 

*^2 = ^ Cjy = ^ (^jybjy 

(54) 

y=l y=l 

V V 

S3 = ^ ( Cxi — ^ ) Ojxibxi 

(55) 

X=1 X=1 

V V 

5*4 — ^ ^ Cxj — ^ ^ G^xjbxj 

X=1 X=1 

(56) 
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and 


T — Cij + Cji + Cii + Cjj. 


( 57 ) 


T is the sum of the values that have been summed twice by Si, S2, S3 and 
54 in Eq. Notice that 5 i + ^2 + + ^4 = 2 T when V = 2 . Since 

- bxx - 0) 

T — 0>ijhij “t" CLjibji. 


Recalling Eq. 48 S3 can be expressed as 


5*3 — ^ ( (1 biy) (1 bii) 

y=l 

V 

= ^(1 — CLiy){l — biy) — 1 . 

y=l 


( 59 ) 


Similarly, ^4 can be expressed as 


y 

S 4 = ^(1 — 0,jy){l — bjy) — (1 — ajj){l — bjj) 

y=l 

V 

= ^(1 ~ ®JJ/)(1 ~ bjy) — 1. 

y=l 


( 60 ) 


On the one hand, 


V 

Si S3 — ^ ( Ciiy{ 2 ibiy 

y=l 


V 

l)+V-Y.hy-l. 


y=l 


( 61 ) 


On the other hand. 


V V 

S2 + S4 = ^ ajy{2bjy — 1 ) + V — ^ bjy — 1. 

y=l y=l 


( 62 ) 


Thus, Eq. 52 becomes 


V V 

7 = [aiy(2biy — 1) + ajy{2bjy — 1)] — '^^(biy + bjy) + 

y=l y=l 

2(R 1) dijbij djibji. (hH) 
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By definition, 


with 


7 ' = 

= S(AS' 

A s' A S' - r 

(64) 


V 

V 


s( 

II 

M 

II 

M 

p 

(65) 


y=l 

y=l 



V 

V 



II 

M 

= E ^iy^'jy 

(66) 


y=l 

y=l 



V 

V 


^3 

= Vc ■ 

/ y '"XI 

= E/ 

(67) 


X=1 

X=1 



V 

V 



= E = 

X=1 

= E 

X=1 

(68) 


and 


rrif f \ f \ f \ f 

^ ^33 

— a.jj5jj + Cl jib -j- dab a + ajjbjj. 


(69) 

T' is the sum of the values that have been summed twice by S[, S 2 , S 3 and 
S 4 in Eq. 64 Notice that S( + S 2 + S 3 + S 4 = 2T' when V = 2. Applying 
Eq. 49 T' becomes 

T — Clijbji Cijibij A Ojiibii A ^jj^jj 

= ttijbji A djibij (applying a^x = bxx = 0) 

= dij A bij - 2dijbij. (applying dji = 1 - dij and bji = 1 - bijjYO) 


Applying Eq. 49, S( can be expressed as 

V 

S( = '^diybjy - diibji - dijbjj + diibii + dij{l - bij) (applying = 1 - 5jj) 

y=l 

V 

= ^diybjy + dij{l - bij) (applying 03 , 3 , = 63 , 3 , = 0 ) (71) 

y=l 

while S 2 can be expressed as 

y 

S 2 = ^ ( ^jybiy ^jibii ^jjbij A djibij A Ajjbjj 

y=l 

V 

— ^ ) ^jy biy A (1 dij)bij. 

y=l 


(72) 
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Similar arguments give 


^ ^ O^xi^xj “t“ ^ji^ij 

x=l 

V 

— ^ ' 0 ,xibxj “ 1 “ (1 ^ij) 
x=l 

V 

^ ^ Cixj^xi ^ij^ii ^jj^ji S~ ^ij^ji “1“ ^jj^jj 
x=l 

V 

— 'y ' O'xjbxi ~\~ Qij(l bij ). 


( 73 ) 


(74) 


X=1 


Recalling Eq. 48 S'^ can be expressed as 


= ^(1 — o,iy){l — bjy) — (1 — aii){l — bji) — (1 — a*j)(l — bjj) + 

y=l 

{l-aij)bij (75) 

= ^(1 — flii/)(l ~ bjy) — bij — (1 — ttij) + bij — aijbij 

y=l 

V 

- y ( (1 Ojiy)(^i bjy) 4“ Ciij Ciijbij 1. (^^) 

y=l 

Similarly, S'^ can be expressed as 

5*4 - y ( (1 Cljy)(\ biy) (1 Q,jj)(l bii) (1 ffljj)(l bij) -\- 

y=l 

O^ij (1 bij ) 

V 

— y ( (1 biy) dij (1 bij) dij dijbij 

y=l 

V 

— y ( (1 ®i3/)(l biy) bij dijbij 1. C^'^) 

y=l 


On the one hand, 


V V 

+ ^3 = ^ diy{2bjy — 1) + V —'^bjy + S 4 3 

y=l y=l 


(78) 
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with 


^ij ^ji ^ 

= Clij (1 bij ) “t" (lij “t" 1 bij 2 


— ^Ciij bij CLijbij 1. 


On the other hand, 


with 


V V' 

*^2 + ‘ 5*4 = 0,jy{2biy — 1) -\-V — biy + s'r 
y=l y=l 


^2 4 — CLjii^l ^ji ^ij ^ 

— (1 CLij ) bij “1-1 CLij -\- bij 2 

— O^ij ‘2bij dijbij 1 . 


2,4 


(79) 


(80) 


(81) 


Finally, Eqs. 70, 78 and 80 transform Eq. 64 into 

V V 

7 = ^ [(^iyi^^jy ~ 1) + (^jy{‘^biy — 1)] — '^(biy + bjy) + 

y=l y=l 

2 ( E dij bij 1) dij “t" bij 

Combining Eqs. [63| and yields, after some algebra, 

= i -1 

V 

= 2 ^ CUyf^y + 1 

y=\ 

where 


(82) 


f^y ~ ^jy ^iy 


(83) 


(84) 

(85) 


The final formulae for 7 (Eq. 63), 7 ' (Eq. and (Eq. 83) have 
been verified with the help of computer simulations. For a given V, the 
simulation is based on the following algorithm: 

• Setup 

1. Generate a vector A of size V containing numbers from 1 to E 
(in that order). 
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2. Generate a vector B of size V containing numbers from 1 to ^ 
(in that order) if the initial state is r = 1 and containing those 
numbers in the reverse order if that state is r = — 1 . 

3. Calculate ric and 7 with A and B. 

• Test: run T times 


1 . 

2 . 

3. 

4. 


5. 


Choose uniformly at random two integers i and j such that 1 < 
hj<y and i / j. 

Swap the z-th and the j-th element of B. 

Calculate n(, and 7 ' with A and the new B. 

Check that 


(a) 7 coincides with the value provided by Eq. 63 

(b) 7 ' coincides with the value provided by Eq. 


82 


(c) An^ = n'^ — ric coincides with the value provided by Eq. 83 


nc = 7 = 7 '. 


The algorithm was run successfully for E = 1 to 100 with T = 10^ and both 
initial states. 


B.2.1 Pi and pj are swapped 

Notice that this case is equivalent to the case when li and Ij are swapped by 
symmetry. It suffices to exchange the role of the matrices A and B. Now 
Uxy indicates if lx < ly and bxy indicates if px < Py 


B.2.2 The variation of r 

If ties are missing, = no — Uc and r becomes 

2nc . 

T =-1. 

no 


( 86 ) 


Then the discrete derivative of r (as a function of the number of swaps) is 


A. = 


m - Uc 


no 


no 


(87) 

( 88 ) 
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( 89 ) 


where the discrete derivative of ric, satisfies 

V 

— 2 ^ ^yf^y + 1 

y=l 

as explained in Section |B.1[ 

Without any loss of generality, suppose that i < j. We want to prove 
that 

Statement 1 If the pair {pi,li) and {pj,lj) is concordant then < 0, which is 
equivalent to 

^ 1 

^yl^y < (90) 

y=l 

Statement ^ If i and j are a discordant pair then A„^ > 0, which is equivalent to 

^ 1 

ay/3j/ > --■ (91) 

y=l 

First, we notice some relevant properties of a and (3. Suppose that pi < 
P 2 < ... < Pk < ... < Pv- Then it is easy to see that 

Property 1 Uy = 1 i < y < j and = 0 otherwise. 

Now suppose that h < I 2 < ... <lk < ... < W- Then it is easy to see that 

Property 2 f3y = —1 i < y < j and Py = 0 otherwise. 

If the pair {pi,li) and {pj,lj) i is concordant. Properties 1 and 2 indicate 
that OyPy G { — 1,0}, giving 

V' 

Y,ay/3y<0. (92) 

y=l 

Adding that Oifii = —ajibji = — 1 or Oj^j = —Oijbij = — 1 when the pair 
{pi,li) and {pj,lj) is concordant, it follows that 

y 

^y^y — (93) 

y=l 

which proves Statement 1. If z and j are discordant. Properties 1 and 2 
indicate that OyPy G {0,1}, giving 

E«A>0, (94) 

y=l 

which proves Statement 2. 
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B. 2.3 The variation of A 

The value of A after one of those swaps will be examined considering two 
cases. First, imagine that li and Ij are swapped. The value of A after the 
swap is 

A' = A - {piXi + pjXj) + {piX[ + PjX'j). (95) 

Applying A( = Xj and A(- = A* it is obtained 

Aa = {Pi - Pj){g{lj) - g{k))- (96) 

The fact that g{l) is a monotonically increasing function of I allows one to 
conclude that 

• If the pair was concordant then Aa < 0. 

• If the pair was discordant then Aa > 0. 

Second, imagine that pi and pj are swapped. The value of A after the 
swap is 

A' = A - {piXi + PjXj) + (p'Ai + PjXj). (97) 

Applying p' = pj and pi = pj it is obtained again Eq. 
arguments and conclusions apply to the swap of pi and pj. 

C The constancy conditions 

C. l Preliminaries: a lower bound for iil[A] 

The optimal coding method presented above allows one to derive a lower 
bound for 

= ( 98 ) 

^ i=l 

The point is that F1[A] can be seen as a particular case of A with pj = 1/P. 
Suppose that Imax is the maximum string length needed by the optimal 
coding above. Obviously, Imax is the smallest integer such that 

^max 

V (99) 

i=i 

Such an optimal coding requires all strings of length smaller than Imax and 

U = V- (100) 

1=1 


96 


Thus, the same 
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strings of length 


Therefore, Eq. 98 gives 

/ Imax 1 


E 9{l)\+Ug{lma.). 


1=1 


( 101 ) 


C.2 The constancy of V, -^[A], cr[A] and a[p] 


The constancy of V could derive from the existence of a core lexicon |591 iBHl 
and social constraints on the addition and propagation of new types 


m- 

The constancy of -E[A] and it [A] could be due to cost-cutting pressures. 
VE[X\ can be regarded as the cost of learning the strings of symbols mak¬ 
ing the repertoire and storing them in memory. Suppose that every type is 
assigned a different string of symbols, i.e. the coding scheme is nonsingular 
|25) . One could use strings of length [log^ E] to code for every type but 
this would be a waste. A lower bound for E'[A] is given by Eq. 101 We 


expect that natural systems are attracted towards this lower bound to min¬ 
imise the cost of storing the repertoire, providing support for the simplifying 
assumption that El [A] is constant. 

The constancy of (t[A] can be supported by the need of intermediate 
values of cr[A]: 


• A small value of it [A] might be difficult or impossible to achieve. Let 
us consider that (t[A] is minimum, i.e. (t[A] = 0, which is equivalent to 
all types having the same magnitude k. If E < then some types 
are not distinguishable (the coding scheme is nonsingular), which is 
something to avoid. Recall that the unconstrained solution to the 
minimisation of A, i.e. Eq. yields cj[A] = 0 but sacrificing the 
distinguishability of types (if V > N, distinguishability imposes that 
(t[A] is bounded below by a non-zero value). Although it is possible to 
code any repertoire with strings of length 1 from an alphabet (one only 
needs that V = N), strings of length greater than one have been shown 
to be evolutionary advantageous to combat noise |18) . If V > 
then the high cost implied by the value of El [A] (as explained above) 
turns (t[A] = 0 unlikely. A further reason against (t[A] = 0 is that, 
under pressure to minimise El [A] or A close to the optimal coding for 
nonsingular codes, all lengths up to Imax are taken. 


• Let us consider that a [A] is large (much greater than the value of (t[A] 
for nonsingular codes). Then very long strings are expected but this is 
unnecessarily costly and then less likely to happen. 
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Finally, the constancy of a[p] could arise from mechanisms that shape symbol 
probabilities independently from A but that can still involve cost-cutting 
factors [201 01] . 


D Optimisation and correlation in community de¬ 
tection 

We have shown above an intimate relationship between A and r. A straight¬ 
forward relationship between a function to optimise and correlation is also 
found in methods for community detection in networks jG^. A central con¬ 
cept in those methods is Q, a measure of the quality of a partitioning into 
communities of a network that must be maximised |63]. Interestingly, Q 
is intimately related with Fisher’s intraclass correlation |64j . another cor¬ 
relation coefficient that should not be confused with the popular Pearson 
interclass correlation r that we have considered above. To see it in detail, 
suppose that m is the number of edges of a network, ki is the degree of the 
i-th vertex, A is the adjacency matrix and Ci is the community to which the 
Tth vertex belongs, Q can be defined as |651 p. 224] 



( 102 ) 


where S{x,y) is the Kronecker delta {S{x,y) = 1 if x = y, d{x,y) = 0 
otherwise). The intraclass correlation that is connected with Q is defined 
between the communities at both ends of an edge. If Xi is a scalar quantity 
associated to the z-th vertex, the intraclass correlation between Xi and Xj 
over edges is Vintra = cov{xi,Xj)/a'^, where ax is the standard deviation of 
the Xi’s and coVintra{xi,Xj) is the intraclass covariance between Xi and Xj 
over edges, i.e. |65l p. 228] 



(103) 


The similarity between Q and covintraixi,Xj) is strong: the only difference 
is that S(ci,Cj) in Q is replaced by XiXj in covintra{xi, Xj) |65l P- 228]. For 
the case of only two communities, one may define a variant of Q, namely 



3 


(104) 
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where s* G { —1,+1}, indicates the community of the i-th vertex. Then, 
Q' = cov{si, Sj). Thus, the maximisation of Q' is fully equivalent to the 
maximisation of Vintra- 

E Applications beyond the law of abbreviation 

The results presented in this article go beyond Zipf’s law of abbreviation. 
For instance, the online memory cost of a sentence of n words can be defined 

as [MUM] 

n—1 

D = {n-l)J2pid)g{d), (105) 

d=l 

where n — 1 is the number of edges of a syntactic dependency tree of n 
vertices, p{d) is the proportion of dependencies of length d and g(d) is the 
cognitive cost of a dependency of length d. Recently, it has been argued 
that g{d) may not be a monotonically decreasing function of d as commonly 
believed |23| . The minimisation of D can be regarded as particular case of A 
where V = n — 1, pi is p{d) and g{li) is g{d) and thus a negative correlation 
between p{d) and g{d) is predicted applying the arguments employed in 
this article. Finally, knowing that p{d) is a decreasing function of d in real 
syntactic dependencies 123 Ell and under the null hypothesis that the words 
of a sentence are arranged linearly at random |22j . a positive correlation 
between d and g{d) follows. 
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